This section is dedicated to the preparation of the dataframes used in the subsequent sections. We present here what is actually performed in CSVtoSQLconverter.py where the CSV files are read into Pandas dataframes, the dataframes are prepared and finally saved in a SQL database. This notebook explains the preparation of the dataframes using the 2014 data.
The first csv file contains the overall characteristics of each recorded accident such as the location, the luminosity, the type of road, and the precise date and time of the accident.
In [1]:
import pandas as pd
import numpy as np
from pandas import DataFrame,Series
%matplotlib inline
In [2]:
caract_df = pd.read_csv("data/2014_France_caracteristiques.csv",sep=',', encoding='latin-1')
In [3]:
caract_df.head()
Out[3]:
In [4]:
import datetime as dt
The time of the accident is stored in an integer in 24H format such that 5:20pm becomes 1720. However some of the strings are "1" or "15" or "29". One possible explanation is that the recorded number was "029" for 29 minutes past midnight but once converted to an integer it became "29". We can verify this by plotting the two distributions of values for hrmn > 100 (the hour after midnight) and hrmn >= 2300 (the hour before midnight), and compare them.
In [5]:
# The hour after midnight
caract_df[caract_df['hrmn'] < 100]['hrmn'].plot.hist(bins=range(1,60),figsize=(12,6))
Out[5]:
In [6]:
# The hour before midnight
caract_df[caract_df['hrmn'] >= 2300]['hrmn'].plot.hist(bins=range(2300,2360),figsize=(12,6))
Out[6]:
We can now confidently convert integers < 100 to strings 4-character long and starting with 0.
In [7]:
def columns_to_datetime(sr):
stringall = "%d-%d-%d %04d" % (sr['an'], sr['mois'], sr['jour'], sr['hrmn'])
sr = sr.append(pd.Series(dt.datetime.strptime(stringall, "%y-%m-%d %H%M"), index=["datetime"]))
return sr
In [8]:
clean_df = caract_df.apply(columns_to_datetime,axis=1)
In [9]:
clean_df.head()
Out[9]:
In [10]:
# Let's check that our conversion worked properly
clean_df[clean_df['hrmn'] == 29]
Out[10]:
In [11]:
# We can remove the original time and date columns
clean_df.drop(['an','mois','jour','hrmn'],axis=1, inplace=True)
clean_df.head()
Out[11]:
In [12]:
clean_df.isnull().sum()
Out[12]:
In [13]:
clean_df.shape
Out[13]:
Half of the longitude, latitude and the gps variable are missing. Many of the adresses are also missing and we don't have any use for a very precise location so we can drop them. We still have the area code ("dep") for a rough location in France.
In [14]:
clean_df.drop(['adr','gps','lat','long'],axis=1, inplace=True)
clean_df.head()
Out[14]:
In [15]:
clean_df.rename(columns={'Num_Acc':'accident id','lum':'luminosity','agg':'in city','int':'intersect type',
'atm':'weather','col':'collision type','com':'city id','dep':'area id'},
inplace=True)
clean_df.head()
Out[15]:
For some reason most area IDs have an added 0 except for the overseas areas 971, 972, 973, 974, and 976. Corsica is split into the areas 2A and 2B, which are labelled 201 and 202 in this dataframe. We will keep 201 and 202 to keep the column filled with integers. So we will divide by 10 only the values that are multiples of 10.
In [16]:
clean_df['area id'].unique()
Out[16]:
Let's correct those for clarity sake.
In [17]:
def correct_area_id(sr):
if ( (sr['area id'] %10) == 0):
sr['area id'] = int(sr['area id']/10)
return sr
clean_df = clean_df.apply(correct_area_id,axis=1)
In [18]:
clean_df['area id'].unique()
Out[18]:
In [19]:
## This is done in the CSVtoSQLconverter.py script
# clean_df.to_sql(name='characteristics', con=sqlEngine, if_exists = 'replace', index=False)
In [20]:
lieux_df = pd.read_csv("data/2014_France_lieux.csv",sep=',', encoding='latin-1')
In [21]:
lieux_df.head()
Out[21]:
In [22]:
lieux_df.isnull().sum()
Out[22]:
In [23]:
lieux_df.shape
Out[23]:
Some of the most unreliable columns are fortunately also not very useful since they pertain to the address of the accident which we have already decided to drop.
In [24]:
clean_df = lieux_df.drop(['voie','v1','v2','pr','pr1'],axis=1)
clean_df.head()
Out[24]:
In [25]:
clean_df.rename(columns={'Num_Acc':'accident id','catr':'road type','circ':'traffic mode',
'nbv':'nb lanes','vosp':'reserved lane','prof':'road profil',
'plan':'road alignment','lartpc':'central reservation',
'larrout':'road width','surf':'road surface','infra':'installations',
'situ':'location','env1':'school distance'},
inplace=True)
clean_df.head()
Out[25]:
In [26]:
## This is done in the CSVtoSQLconverter.py script
# clean_df.to_sql(name='locations', con=sqlEngine, if_exists = 'replace', index=False)
In [27]:
vehicules_df = pd.read_csv("data/2014_France_vehicules.csv",sep=',', encoding='latin-1')
In [28]:
vehicules_df.head()
Out[28]:
In [29]:
vehicules_df['senc'].value_counts().head()
Out[29]:
In [30]:
clean_df = vehicules_df.drop('senc',axis=1)
In [31]:
clean_df.head()
Out[31]:
In [32]:
clean_df['catv'].value_counts().sort_index()
Out[32]:
It is more convenient to reclassify the types using categories from 1 to 24:
In [33]:
newMap = {}
for i,I in enumerate(clean_df['catv'].value_counts().sort_index().index):
newMap[I] = i+1
clean_df['catv'] = clean_df['catv'].map(newMap)
clean_df['catv'].value_counts().sort_index()
Out[33]:
In [34]:
clean_df.rename(columns={'Num_Acc':'accident id', 'catv':'vehicle type','obs':'fixed obj hit',
'obsm':'moving obj hit', 'choc':'impact location','manv':'maneuver',
'occutc':'nb occupants public transit', 'num_veh':'vehicle id'},inplace=True)
In [35]:
clean_df.head()
Out[35]:
In [36]:
## This is done in the CSVtoSQLconverter.py script
# clean_df.to_sql(name='vehicles', con=sqlEngine, if_exists = 'replace', index=False)
This final dataframe contains the information relative to the people or road users involved in the accident. The dataframe contains both personal information on the users such as their birthdate or their sex, but it also has for example information on the safety gear worn, the location and direction of the pedestrians hit.
In [37]:
usagers_df = pd.read_csv("data/2014_France_usagers.csv",sep=',', encoding='latin-1')
In [38]:
usagers_df.head()
Out[38]:
We will keep all the columns of this dataframe since they all contain some interesting data.
The 'secu' and 'an_nais' columns represent respectively the safety gear used (or not used) and the birth year of each user. The 'secu' is not too practical to use as it is since it store the type of safety gear in the first digit, and if the safety gear was used or not in the second digit of an integer. It will be simpler to have 2 columns for these two items.
In [39]:
usagers_df['secu'].value_counts(dropna=False)
Out[39]:
We should always have the first digit since the value 9 represent 'others' for non-defined types.
In [40]:
def safety_to_twocolumns(sr):
if not np.isnan(sr['secu']):
safetychar = "%2d" % (sr['secu'])
if safetychar[0] != ' ':
sr = sr.append(pd.Series(int(safetychar[0]), index=["safety gear type"]))
sr = sr.append(pd.Series(int(safetychar[1]), index=["safety gear worn"]))
return sr
In [41]:
clean_df = usagers_df.apply(safety_to_twocolumns,axis=1)
In [42]:
clean_df.head()
Out[42]:
In [43]:
clean_df.drop('secu',inplace=True,axis=1)
clean_df.head()
Out[43]:
The birth year is not that interesting by itself. The age of the users seems more interesting. We can replace the birth year by the age.
In [44]:
clean_df['age'] = 2014 - clean_df['an_nais']
In [45]:
clean_df.drop('an_nais',inplace=True,axis=1)
clean_df.head()
Out[45]:
In [46]:
clean_df.rename(columns={'Num_Acc':'accident id', 'num_veh':'vehicle id',
'place':'location in vehicle','catu':'user type',
'grav':'severity','sexe':'sex','trajet':'journey type',
'locp':'pedestrian location','actp':'pedestrian action',
'etatp':'pedestrian company'},inplace=True)
In [47]:
clean_df.head()
Out[47]:
In [48]:
## This is done in the CSVtoSQLconverter.py script
# clean_df.to_sql(name='users', con=sqlEngine, if_exists = 'replace', index=False)